Dynomotion

Group: DynoMotion Message: 13294 From: Hardy Family Date: 5/22/2016
Subject: Improving bit banging speed for SPI
Hi Tom,

I know this subject has been covered before, but I was wondering if I'm missing something.  The following loop outputs a 2MHz square wave, which seems to be rather slow considering it is doing practically nothing except toggle the FPGA location.  I'm using cl6x with options -mv6710 -ml3 -mu -O2 --opt_for_space

unsigned short spi_rw(register unsigned short mosi)
{
    unsigned i;
    register unsigned short miso;
   
    for (i = 0; i < 16; ++i) {
        *sclk_fpgaset = sclk_hi;
        *sclk_fpgaclr = sclk_lo;
    }
    return miso;
}


sclk_fpgaset/clr are macros which are constant pointers to the appropriate memory mapped locations, and sclk_lo/hi are constants.

If I actually try to read/write data, then it slows down to about 800kHz clock rate:

unsigned short spi_rw(register unsigned short mosi)
{
    unsigned i;
    register unsigned short miso;
   
    for (i = 0; i < 16; ++i) {
        *sclk_fpgaset = sclk_hi;
        //if (mosi & 0x8000)
        //    *mosi_fpgaset = mosi_hi;
        //else
        //    *mosi_fpgaclr = mosi_lo;
       
        // branch free...
        *(volatile unsigned char *)(0x91000452 ^ (mosi>>9 & 0x40)) = 0xF7 ^ ((unsigned char)0 - (unsigned char)(mosi>>15));
        mosi <<= 1;
       
        // read...
        miso = miso<<1 | *miso_fpgard>>2 & 1;
        *sclk_fpgaclr = sclk_lo;
    }
    return miso;
}


So I can't work out why any read/write to the fpga pin is taking 250ns.  Are there a lot of wait states added to fpga access in that 0x90000000 block?  The DSP has a 5ns cycle time, and it can stall up to 6 cycles if the pipeline requires, so the slowest instruction should be 30ns.  So it looks like 200ns overhead accessing the FPGA?

Note that that horrible branch-free code didn't make a noticeable difference compared with the more obvious code, so I don't think it's the code which is slow.

I'm trying to read/write a 16-bit value every tick (90us), but it's taking about 20us which is a significant fraction of the CPU.  It would be nice to be able to cut this down to 10% CPU.

Regards,
SJH


Group: DynoMotion Message: 13295 From: Tom Kerekes Date: 5/23/2016
Subject: Re: Improving bit banging speed for SPI

Hi SJH,

I don't think you are missing anything.  Off chip accesses are horribly slow compared to on-chip activity.  I find it ironic that the DSP can theoretically do like 100+ 32-bit floating point operations in the time to set a bit in the FPGA.

The EMIF (External Memory Interface) runs at 100MHz not 200MHz.  We set the Asynchronous Write timing for 3 Write Setup cycles, 15 Write Strobe cycles, and 1 Hold Cycles.  Which should be 190ns.  And there are always extra penalties for bus turnaround and whatnot.

I have found that external writes don't cost much if they can be spread out.  Its like sending data through a pipeline.  Once it gets sent the DSP can go on.  However if the next thing needs to be sent it stalls if the pipeline isn't empty.  Reads always stall because the DSP must wait to receive the data.

I doubt if it will help much  but you should probably look at the assembly code to see what is coded by the compiler.  Maybe try optimization level 3 with -O3 and remove the opt_for_space option?  Or is that removing it? 

Regards

TK


On 5/22/2016 9:20 PM, Hardy Family hardy.woodland.cypress@... [DynoMotion] wrote:
 
Hi Tom,

I know this subject has been covered before, but I was wondering if I'm missing something.  The following loop outputs a 2MHz square wave, which seems to be rather slow considering it is doing practically nothing except toggle the FPGA location.  I'm using cl6x with options -mv6710 -ml3 -mu -O2 --opt_for_space

unsigned short spi_rw(register unsigned short mosi)
{
    unsigned i;
    register unsigned short miso;
   
    for (i = 0; i < 16; ++i) {
        *sclk_fpgaset = sclk_hi;
        *sclk_fpgaclr = sclk_lo;
    }
    return miso;
}


sclk_fpgaset/clr are macros which are constant pointers to the appropriate memory mapped locations, and sclk_lo/hi are constants.

If I actually try to read/write data, then it slows down to about 800kHz clock rate:

unsigned short spi_rw(register unsigned short mosi)
{
    unsigned i;
    register unsigned short miso;
   
    for (i = 0; i < 16; ++i) {
        *sclk_fpgaset = sclk_hi;
        //if (mosi & 0x8000)
        //    *mosi_fpgaset = mosi_hi;
        //else
        //    *mosi_fpgaclr = mosi_lo;
       
        // branch free...
        *(volatile unsigned char *)(0x91000452 ^ (mosi>>9 & 0x40)) = 0xF7 ^ ((unsigned char)0 - (unsigned char)(mosi>>15));
        mosi <<= 1;
       
        // read...
        miso = miso<<1 | *miso_fpgard>>2 & 1;
        *sclk_fpgaclr = sclk_lo;
    }
    return miso;
}


So I can't work out why any read/write to the fpga pin is taking 250ns.  Are there a lot of wait states added to fpga access in that 0x90000000 block?  The DSP has a 5ns cycle time, and it can stall up to 6 cycles if the pipeline requires, so the slowest instruction should be 30ns.  So it looks like 200ns overhead accessing the FPGA?

Note that that horrible branch-free code didn't make a noticeable difference compared with the more obvious code, so I don't think it's the code which is slow.

I'm trying to read/write a 16-bit value every tick (90us), but it's taking about 20us which is a significant fraction of the CPU.  It would be nice to be able to cut this down to 10% CPU.

Regards,
SJH



Group: DynoMotion Message: 13298 From: Hardy Family Date: 5/23/2016
Subject: Re: Improving bit banging speed for SPI
Thanks for the reply.  I just realized that the SCLK rising edge and MOSI writes can be done at the same time, since they are both on the same 8-bit FPGA register.  That should speed it up by about 30%.

I've been trying it out and, in practice, it seems to be OK.  It's certainly a lot better than the old bit-banged I2C we used on our first board.  If you're ever feeling really energetic, it would be cool to have a simple SPI-like shift register on the fpga: read 8 bits, then write 8 bits and it shifts out automatically (and reads in the next 8 bits).  Even if you have to wait for a known time interval before sending the next byte, that would be great.  Hint: Use I/O bits 33,34,35 for SCLK, MISO and MOSI  :-)

Regards,
SJH


On Mon, May 23, 2016 at 2:25 PM, Tom Kerekes tk@... [DynoMotion] <DynoMotion@yahoogroups.com> wrote:
 

Hi SJH,

I don't think you are missing anything.  Off chip accesses are horribly slow compared to on-chip activity.  I find it ironic that the DSP can theoretically do like 100+ 32-bit floating point operations in the time to set a bit in the FPGA.

The EMIF (External Memory Interface) runs at 100MHz not 200MHz.  We set the Asynchronous Write timing for 3 Write Setup cycles, 15 Write Strobe cycles, and 1 Hold Cycles.  Which should be 190ns.  And there are always extra penalties for bus turnaround and whatnot.

I have found that external writes don't cost much if they can be spread out.  Its like sending data through a pipeline.  Once it gets sent the DSP can go on.  However if the next thing needs to be sent it stalls if the pipeline isn't empty.  Reads always stall because the DSP must wait to receive the data.

I doubt if it will help much  but you should probably look at the assembly code to see what is coded by the compiler.  Maybe try optimization level 3 with -O3 and remove the opt_for_space option?  Or is that removing it? 

Regards

TK


On 5/22/2016 9:20 PM, Hardy Family hardy.woodland.cypress@... [DynoMotion] wrote:
 
Hi Tom,

I know this subject has been covered before, but I was wondering if I'm missing something.  The following loop outputs a 2MHz square wave, which seems to be rather slow considering it is doing practically nothing except toggle the FPGA location.  I'm using cl6x with options -mv6710 -ml3 -mu -O2 --opt_for_space

unsigned short spi_rw(register unsigned short mosi)
{
    unsigned i;
    register unsigned short miso;
   
    for (i = 0; i < 16; ++i) {
        *sclk_fpgaset = sclk_hi;
        *sclk_fpgaclr = sclk_lo;
    }
    return miso;
}


sclk_fpgaset/clr are macros which are constant pointers to the appropriate memory mapped locations, and sclk_lo/hi are constants.

If I actually try to read/write data, then it slows down to about 800kHz clock rate:

unsigned short spi_rw(register unsigned short mosi)
{
    unsigned i;
    register unsigned short miso;
   
    for (i = 0; i < 16; ++i) {
        *sclk_fpgaset = sclk_hi;
        //if (mosi & 0x8000)
        //    *mosi_fpgaset = mosi_hi;
        //else
        //    *mosi_fpgaclr = mosi_lo;
       
        // branch free...
        *(volatile unsigned char *)(0x91000452 ^ (mosi>>9 & 0x40)) = 0xF7 ^ ((unsigned char)0 - (unsigned char)(mosi>>15));
        mosi <<= 1;
       
        // read...
        miso = miso<<1 | *miso_fpgard>>2 & 1;
        *sclk_fpgaclr = sclk_lo;
    }
    return miso;
}


So I can't work out why any read/write to the fpga pin is taking 250ns.  Are there a lot of wait states added to fpga access in that 0x90000000 block?  The DSP has a 5ns cycle time, and it can stall up to 6 cycles if the pipeline requires, so the slowest instruction should be 30ns.  So it looks like 200ns overhead accessing the FPGA?

Note that that horrible branch-free code didn't make a noticeable difference compared with the more obvious code, so I don't think it's the code which is slow.

I'm trying to read/write a 16-bit value every tick (90us), but it's taking about 20us which is a significant fraction of the CPU.  It would be nice to be able to cut this down to 10% CPU.

Regards,
SJH